Change “your name” in the YAML header above to your name.
As usual, enter the examples in code chunks and run them, unless told otherwise.
Read R4ds Chapter 10: Tibbles, sections 1-3.
Load the tidyverse package.
library(tidyverse)
[30m-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.2.1 --[39m
[30m[32mv[30m [34mggplot2[30m 3.1.0 [32mv[30m [34mpurrr [30m 0.2.5
[32mv[30m [34mtibble [30m 1.4.2 [32mv[30m [34mdplyr [30m 0.7.8
[32mv[30m [34mtidyr [30m 0.8.2 [32mv[30m [34mstringr[30m 1.3.1
[32mv[30m [34mreadr [30m 1.3.0 [32mv[30m [34mforcats[30m 0.3.0[39m
[30m-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[30m [34mdplyr[30m::[32mfilter()[30m masks [34mstats[30m::filter()
[31mx[30m [34mdplyr[30m::[32mlag()[30m masks [34mstats[30m::lag()[39m
Enter your code chunks for Section 10.2 here.
Describe what each chunk code does.
as_tibble(iris)
tibble is a like an improved form of a data.frame.
tibble(
x = 1:5,
y = 1,
z = x ^ 2 + y)
tibble does not change the inputs or names of variables and never creates row names.
tb <- tibble(
`:)` = "smile",
` ` = "space",
`2000` = "number"
)
tb
back ticks allow you to refer to the variables in non-syntactic names.
tribble(
~x, ~y, ~z,
#--|--|----
"a", 2, 3.6,
"b", 1, 8.5
)
tribble means transposed tibble. It is “customized for data entry in code: column headings are defined by formulas (start with ~) and separated by commas, this makes it easy to read.”
Enter your code chunks for Section 10.3 here.
Describe what each chunk code does.
tibble(
a = lubridate::now() + runif(1e3) * 86400,
b = lubridate::today() + runif(1e3) * 30,
c = 1:1e3,
d = runif(1e3),
e = sample(letters, 1e3, replace = TRUE))
This shows only the first ten rows and columns that fit on the screen. This allows large data to be easy to work with.
nycflights13::flights %>%
print(n = 10, width = Inf)
This allows you to print the data frame explicitly and control the number of rows and the width of the display. width = inf shows all the columns.
nycflights13::flights %>%
View()
the view() function allows you to view the whole data set in a table.
df <- tibble(
x = runif(5),
y = rnorm(5))
df$x
[1] 0.72324706 0.04123625 0.97885578 0.71327709
[5] 0.83193281
df[["x"]]
[1] 0.72324706 0.04123625 0.97885578 0.71327709
[5] 0.83193281
df[[1]]
[1] 0.72324706 0.04123625 0.97885578 0.71327709
[5] 0.83193281
These functions allow you to extract a single variable by name or position.
df %>% .$x
[1] 0.72324706 0.04123625 0.97885578 0.71327709
[5] 0.83193281
df %>% .[["x"]]
[1] 0.72324706 0.04123625 0.97885578 0.71327709
[5] 0.83193281
This also allows extraction of a single variable, but using pipe and the placeholder “.” is used.
Answer the questions completely. Use code chunks, text, or both, as necessary.
mtcars
as_tibble(mtcars)
str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
class(mtcars)
[1] "data.frame"
1: How can you tell if an object is a tibble? (Hint: try printing mtcars, which is a regular data frame). Identify at least two ways to tell if an object is a tibble. Hint: What does as_tibble() do? What does class() do? What does str() do? A tibble uses row numbers instead of names and displays 9 columns (whereas the data frame only shows six columns–without scrolling). str() lists the data structure of the data frame. Class() tells you what type of class the variable is; for example mtcars is a data.frame.
2: Compare and contrast the following operations on a data.frame and equivalent tibble. What is different? Why might the default data frame behaviours cause you frustration?
df <- data.frame(abc = 1, xyz = "a")
df$x
[1] a
Levels: a
df[, "xyz"]
[1] a
Levels: a
df[, c("abc", "xyz")]
The default data frame functions automatically choose a title for your rows/columns. In tibble, it does not make up names, it chooses it from the data given.
Read R4ds Chapter 11: Data Import, sections 1, 2, and 5.
Nothing to do here unless you took a break and need to reload tidyverse.
Do not run the first code chunk of this section, which begins with heights <- read_csv("data/heights.csv"). You do not have that data file so the code will not run.
Enter and run the remaining chunks in this section.
read_csv("a,b,c
1,2,3
4,5,6")
read_csv("The first line of metadata
The second line of metadata
x,y,z
1,2,3", skip = 2)
read_csv("# A comment I want to skip
x,y,z
1,2,3", comment = "#")
read_csv("1,2,3\n4,5,6", col_names = FALSE)
read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
read_csv("a,b,c\n1,2,.", na = ".")
1: What function would you use to read a file where fields were separated with “|”? read_delim("file.csv", delim = "|") is the function you would use to read the file with entries separated with “|”.
2: (This question is modified from the text.) Finish the two lines of read_delim code so that the first one would read a comma-separated file and the second would read a tab-separated file. You only need to worry about the delimiter. Do not worry about other arguments. Replace the dots in each line with the rest of your code.
file <- read_delim("file.csv", delim = ",")
file <- read_delim("file.csv", delim = "\t")
3: What are the two most important arguments to read_fwf()? Why? col_position and col_types are the most important arguments to read_fwf because they allow you to name your columns and determine the type of data listed in the columns.
4: Skip this question
5: Identify what is wrong with each of the following inline CSV files. What happens when you run the code?
read_csv("a,b\n1,2,3\n4,5,6")
2 parsing failures.
row col expected actual file
1 -- 2 columns 3 columns literal data
2 -- 2 columns 3 columns literal data
read_csv("a,b,c\n1,2\n1,2,3,4")
2 parsing failures.
row col expected actual file
1 -- 3 columns 2 columns literal data
2 -- 3 columns 4 columns literal data
read_csv("a,b\n\"1")
2 parsing failures.
row col expected actual file
1 a closing quote at end of file literal data
1 -- 2 columns 1 columns literal data
read_csv("a,b\n1,2\na,b")
read_csv("a;b\n1;3")
“parsing failures” is the error message associated with the first file. This is because it is set for 2 columns and some of the data has three columns. This causes some of the information to be left out of the table and not displayed. The next file is set for three columns, but only has 2 entries. The next file is set for 2 columns, but only has one entry. The next file has some of the data listed as headers and this is redundant. This is due to data being improperly recorded. The next file is listed with semicolons, but the function used reads entries separated with commas. This causes the data to be listed in the same column because tibble does not recognize the entries as separate entries.
Just read this section. You may find it helpful in the future to save a data file to your hard drive. It is basically the same format as reading a file, except that you must specify the data object to save, in addition to the path and file name.
Read R4ds Chapter 18: Pipes, sections 1-3.
Nothing to do otherwise for this chapter. Is this easy or what?
Note: Trying using pipes for all of the remaining examples. That will help you understand them.
Read R4ds Chapter 12: Tidy Data, sections 1-3, 7.
Nothing to do here unless you took a break and need to reload the tidyverse.
Study Figure 12.1 and relate the diagram to the three rules listed just above them. Relate that back to the example I gave you in the notes. Bear this in mind as you make data tidy in the second part of this assignment.
You do not have to run any of the examples in this section.
Read and run the examples through section 12.3.1 (gathering), including the example with left_join(). We’ll cover joins later.
table4a
table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
table4b %>%
gather(`1999`, `2000`, key = "year", value = "population")
tidy4a <- table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
tidy4b <- table4b %>%
gather(`1999`, `2000`, key = "year", value = "population")
left_join(tidy4a, tidy4b)
Joining, by = c("country", "year")
table2
table2 %>%
spread(key = type, value = count)
2: Why does this code fail? Fix it so it works.
table4a %>%
gather(1999, 2000, key = "year", value = "cases")
#> Error in inds_combine(.vars, ind_list): Position must be between 0 and n
table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
Tthere must be backticks surrounding the years.
That is all for Chapter 12. On to the last chapter.
Read R4ds Chapter 5: Data Transformation, sections 1-4.
Time to get small.
Load the necessary libraries. As usual, type the examples into and run the code chunks.
library(nycflights13)
package 㤼㸱nycflights13㤼㸲 was built under R version 3.5.2
library(tidyverse)
nycflights13::flights
flights
filter(flights, month == 1, day == 1)
jan1 <- filter(flights, month == 1, day == 1)
(dec25 <- filter(flights, month == 12, day == 25))
#> # A tibble: 719 x 19
filter(flights, month == 1)
sqrt(2) ^ 2 == 2
[1] FALSE
near(sqrt(2) ^ 2, 2)
[1] TRUE
near(1 / 49 * 49, 1)
[1] TRUE
filter()Study Figure 5.1 carefully. Once you learn the &, |, and ! logic, you will find them to be very powerful tools.
filter(flights, month == 11 | month == 12)
nov_dec <- filter(flights, month %in% c(11, 12))
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
NA > 5
[1] NA
10 == NA
[1] NA
NA + 10
[1] NA
NA / 2
[1] NA
NA == NA
[1] NA
x <- NA
y <- NA
x == y
[1] NA
is.na(x)
[1] TRUE
df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
filter(df, is.na(x) | x > 1)
1.1: Find all flights with a delay of 2 hours or more.
filter(flights, (arr_delay > 120 | dep_delay > 120))
1.2: Flew to Houston (IAH or HOU)
filter(flights, dest == 'IAH' | dest == 'HOU')
1.3: Were operated by United (UA), American (AA), or Delta (DL).
filter(flights, carrier == 'UA' | carrier == 'AA' | carrier == 'DL')
1.4: Departed in summer (July, August, and September).
filter(flights, month >= 7 & month <= 9)
1.5: Arrived more than two hours late, but didn’t leave late.
filter(flights, arr_delay > 120, dep_delay <= 0)
1.6: Were delayed by at least an hour, but made up over 30 minutes in flight. This is a tricky one. Do your best.
filter(flights, dep_delay >= 60, dep_delay-arr_delay > 30)
1.7: Departed between midnight and 6am (inclusive)
filter(flights, dep_time <=600 | dep_time == 2400)
2: Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?
filter(flights, between(month, 7, 9))
filter(flights, !between(dep_time, 601, 2359))
It uses the data between the variables listed. It does simplify the last two questions!
3: How many flights have a missing dep_time? What other variables are missing? What might these rows represent?
summary(flights)
year month day
Min. :2013 Min. : 1.000 Min. : 1.00
1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00
Median :2013 Median : 7.000 Median :16.00
Mean :2013 Mean : 6.549 Mean :15.71
3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00
Max. :2013 Max. :12.000 Max. :31.00
dep_time sched_dep_time dep_delay
Min. : 1 Min. : 106 Min. : -43.00
1st Qu.: 907 1st Qu.: 906 1st Qu.: -5.00
Median :1401 Median :1359 Median : -2.00
Mean :1349 Mean :1344 Mean : 12.64
3rd Qu.:1744 3rd Qu.:1729 3rd Qu.: 11.00
Max. :2400 Max. :2359 Max. :1301.00
NA's :8255 NA's :8255
arr_time sched_arr_time arr_delay
Min. : 1 Min. : 1 Min. : -86.000
1st Qu.:1104 1st Qu.:1124 1st Qu.: -17.000
Median :1535 Median :1556 Median : -5.000
Mean :1502 Mean :1536 Mean : 6.895
3rd Qu.:1940 3rd Qu.:1945 3rd Qu.: 14.000
Max. :2400 Max. :2359 Max. :1272.000
NA's :8713 NA's :9430
carrier flight tailnum
Length:336776 Min. : 1 Length:336776
Class :character 1st Qu.: 553 Class :character
Mode :character Median :1496 Mode :character
Mean :1972
3rd Qu.:3465
Max. :8500
origin dest air_time
Length:336776 Length:336776 Min. : 20.0
Class :character Class :character 1st Qu.: 82.0
Mode :character Mode :character Median :129.0
Mean :150.7
3rd Qu.:192.0
Max. :695.0
NA's :9430
distance hour minute
Min. : 17 Min. : 1.00 Min. : 0.00
1st Qu.: 502 1st Qu.: 9.00 1st Qu.: 8.00
Median : 872 Median :13.00 Median :29.00
Mean :1040 Mean :13.18 Mean :26.23
3rd Qu.:1389 3rd Qu.:17.00 3rd Qu.:44.00
Max. :4983 Max. :23.00 Max. :59.00
time_hour
Min. :2013-01-01 05:00:00
1st Qu.:2013-04-04 13:00:00
Median :2013-07-03 10:00:00
Mean :2013-07-03 05:22:54
3rd Qu.:2013-10-01 07:00:00
Max. :2013-12-31 23:00:00
8255 have missing dep_time and dep_delay. 8713 missing arr_time. 9430 missing arr_delay. 9430 missing air_time. These rows might represent empty rows or just forgotten data (human error).
4: Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)
NA ^ 0
[1] 1
NA | TRUE
[1] TRUE
FALSE & NA
[1] FALSE
NA ^ 0 = 1. NA | TRUE means if either NA or true is true, then the whole set = TRUE. FALSE & NA is not missing; one of the terms = FALSE. If one term is correct, then the function counts the whole term as correct seems to be the general rule.
Note: For some context, see this thread
arrange()arrange(flights, year, month, day)
arrange(flights, desc(dep_delay))
df <- tibble(x = c(5, 2, NA))
arrange(df, x)
arrange(df, desc(x))
1: How could you use arrange() to sort all missing values to the start? (Hint: use is.na()). Note: This one should still have the earliest departure dates after the NAs. Hint: What does desc() do?
arrange(df, desc(is.na(x)))
2: Sort flights to find the most delayed flights. Find the flights that left earliest.
arrange(flights, desc(dep_delay))
arrange(flights, dep_delay)
This question is asking for the flights that were most delayed (left latest after scheduled departure time) and least delayed (left ahead of scheduled time).
3: Sort flights to find the fastest flights. Interpret fastest to mean shortest time in the air.
arrange(flights, air_time)
Optional challenge: fastest flight could refer to fastest air speed. Speed is measured in miles per hour but time is minutes. Arrange the data by fastest air speed.
4: Which flights travelled the longest? Which travelled the shortest?
arrange(flights, desc(distance))
arrange(flights, distance)
select()select(flights, year, month, day)
select(flights, year:day)
select(flights, -(year:day))
rename(flights, tail_num = tailnum)
select(flights, time_hour, air_time, everything())
1: Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights. Find at least three ways.
select(flights, dep_time, dep_delay, arr_time, arr_delay)
flights %>% select(dep_time, dep_delay, arr_time, arr_delay)
select(flights, c(dep_time, dep_delay, arr_time, arr_delay))
2: What happens if you include the name of a variable multiple times in a select() call?
select(flights, dep_time, dep_time)
It only selects/lists the variable once, since it is the same variable.
3: What does the one_of() function do? Why might it be helpful in conjunction with this vector?
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
flights %>% select(one_of(vars))
This is helpful to know which variables you have previously used. The function brings all the variables that you have used to your attention (as shown above).
4: Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
`select(flights, contains("TIME"))`
Error: object 'select(flights, contains("TIME"))' not found
No, there are too many variables with the word “time” in it. Therefore, it does not try to overload itself. The “helpers” select everything that has the word “time” in it. By default, it is not case sensitive. You can change this default by using “ignore.case = FALSE”.